Predicting the top hits of 2019

Our approach

Description

For this challenge, we have decided to explore several approaches in order to predict the top hits of 2019 :

  • the nationality of the top artists of 2018
  • the countries in which the sales force should be focused
  • the hottest artists to consider for a featuring
  • how to write the lyrics
  • what kind of music is expected by the market
  • the influence of a good music video clip

Data sources

We have considered various data sources for our research :

  • Spotify API : extract top charts of the year, and the most important features of each song
  • Genius.com : extract the lyrics of the top songs of the year 2018 to analyze several features
  • https://www.usnews.com/news/best-countries/slideshows/top-10-most-musical-countries? to estimate the biggest countries for markets to consider
  • Webscrsaping of Pure People and Famous People to extract the number of news and shares regarding the top artists
  • Web Scraping of Social Blade to extract content about YouTube Channels
  • Google Trends for the trends regarding music genres

Open Source libraries used

For this challenge, we considered 3 main open source libraries :

  • pyAudioAnalysis for sound processing
  • NLTK for Natural Language Processing
  • lyricsgenius to extract information from the Genius API

Imports

In [158]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import plotly.plotly as py
import plotly.graph_objs as go
import IPython.display as ipd
import requests
import lyricsgenius as genius
from glob import glob
import os.path as op
from nltk.corpus import stopwords
import re
import itertools
from wordcloud import WordCloud, STOPWORDS
from sklearn import model_selection
from sklearn.svm import LinearSVC
import pickle
from bs4 import BeautifulSoup
import seaborn as sns
from matplotlib import rc
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import metrics
import requests, json, logging
import pandas as pd
import base64
import six
from sklearn import  linear_model

Which country should you focus on ?

In [97]:
# Load datas from the Spotify Top Charts : https://spotifycharts.com/regional

folder = "/Users/maelfabien/TelecomParisTech/INFMDI721/Hackathon/Total/"
onlyfiles = [f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f))]
print("Working with {0} csv files".format(len(onlyfiles)))
Working with 33 csv files
In [98]:
#Append all files

data = []

for file in onlyfiles :
    if file != '.DS_Store' :
        df = pd.read_csv(folder + file, skiprows=1)
        df['country'] = file[9:11]
        df['week'] = file[19:20]
        data.append(df)

data = pd.concat(data, axis=0)
In [100]:
data.head(10)
Out[100]:
Position Track Name Artist Streams URL country week
0 1 thank u, next Ariana Grande 23636303 https://open.spotify.com/track/2rPE9A1vEgShuZx... us 1
1 2 BAD! XXXTENTACION 12056995 https://open.spotify.com/track/4CH66Rxcjcj3VBH... us 1
2 3 Mo Bamba Sheck Wes 10919737 https://open.spotify.com/track/1xzBco0xcoJEDXk... us 1
3 4 SICKO MODE Travis Scott 10216172 https://open.spotify.com/track/2xLMifQCjDGFmkH... us 1
4 5 ZEZE (feat. Travis Scott & Offset) Kodak Black 9256307 https://open.spotify.com/track/7l3E7lcozEodtVs... us 1
5 6 Drip Too Hard (Lil Baby & Gunna) Lil Baby 8978592 https://open.spotify.com/track/78QR3Wp35dqAhFE... us 1
6 7 Without Me Halsey 8880320 https://open.spotify.com/track/5p7ujcrUXASCNwR... us 1
7 8 Sunflower - Spider-Man: Into the Spider-Verse Post Malone 8026446 https://open.spotify.com/track/1A6OTy97kk0mMdm... us 1
8 9 Armed And Dangerous Juice WRLD 7382741 https://open.spotify.com/track/6SAKXCj5jyF6IgP... us 1
9 10 Lucid Dreams Juice WRLD 6690533 https://open.spotify.com/track/0s3nnoMeVWz3989... us 1
In [101]:
#Groups datas by country, over all weeks collected : 4 weeks
data_country = data.groupby(['country'])[["Streams"]].sum().sort_values('Streams', ascending=True)
In [102]:
#Group datas by Artist, over all weeks collected : 4 weeks
data_artists = data.groupby(['Artist'])[["Streams"]].sum().sort_values('Streams', ascending=False)
In [104]:
trace0 = go.Bar(
    x=data_country.index,
    y=data_country['Streams'],
    text=data_country['Streams'],
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace0]
layout = go.Layout(
    title='Regions of the world that consume most music',
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='text-hover-bar')
Out[104]:

Over the past month, the US market has consumed as much music streaming as Germany, Great Britain, France, Astralia, Netherlands, Canada and Japan together.

In [105]:
#plt.bar(data_country.index, data_country['Streams'])
#plt.title('Number of monthly streams')
#plt.show()

Who should you sing with ?

In [108]:
trace0 = go.Bar(
    x=data_artists.head(100).sort_values('Streams', ascending=True).index,
    y=data_artists.head(100).sort_values('Streams', ascending=True).Streams,
    text=data_artists.sort_values('Streams', ascending=True).Streams,
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace0]
layout = go.Layout(
    title='Sales by most popular artist this month in top 10 regions',
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='text-hover-bar')
Out[108]:

This past month, the most popular artists in the charts were :

  • XXX Tentacion
  • Post Malone
  • Khalid
  • Ariana Grande
  • Juice WRLD
  • Lil Baby
  • Drake
  • ...
In [107]:
ipd.Audio('ariana_grande_-_thank_u__next.wav') # load a local WAV file
Out[107]:

What should your nationality be ?

Based on this list : https://www.thefamouspeople.com/singers.php, we have scrapped the most relevant datas regarding the nationality of the most famous artists.

In [145]:
def _handle_request_result_and_build_soup(request_result):
    if request_result.status_code == 200:
        html_doc =  request_result.text
        soup = BeautifulSoup(html_doc,"html.parser")
        return soup

def _convert_string_to_int(string):
    if "K" in string:
        string = string.strip()[:-1]
        return float(string.replace(',','.'))*1000
    else:
        return int(string.strip())

def get_all_links_for_query(query):
    url = website + "/rechercher/"
    res = requests.post(url, data = {'q': query })
    soup = _handle_request_result_and_build_soup(res)
    specific_class = "c-article-flux__title"
    all_links = map(lambda x : x.attrs['href'] , soup.find_all("a", class_= specific_class))

    return all_links

def get_share_count_for_page(page_url):
    res = requests.get(page_url)
    soup = _handle_request_result_and_build_soup(res)
    specific_class = "c-sharebox__stats-number"
    share_count_text = soup.find("span", class_= specific_class).text
    return  _convert_string_to_int(share_count_text)


def get_popularity_for_people(query):  
    url_people = get_all_links_for_query(query)
    results_people = []
    
    for url in url_people:
        results_people.append(get_share_count_for_page(website_prefix + url))
    
    return sum(results_people)

def get_name_nationality(page_url):
    res = requests.get(page_url)
    soup = _handle_request_result_and_build_soup(res)
    specific_class = "btn btn-primary btn-sm btn-block btn-block-margin"
    share_count_text = soup.find("a", class_= specific_class).text
    return  share_count_text
In [143]:
artists_dict = {}

for i in range(1, 17):
    website = 'https://www.thefamouspeople.com/singers.php?page='+str(i)
    
    res = requests.get(website)
    specific_class = "btn btn-primary btn-sm btn-block btn-block-margin"
    soup = _handle_request_result_and_build_soup(res)
    classes = soup.find_all("a", class_= specific_class)
    
    for i in classes:
        mini_array = i.text[:-1].split('(')
        artists_dict[mini_array[0]]=mini_array[1]
    
artists_df = pd.DataFrame.from_dict(artists_dict, orient='index', columns=['Country'])
artists_df.head(n=10)
Out[143]:
Country
Louis Tomlinson British
Aretha Franklin American
Mac Miller American
Freddie Mercury British
Ariana Grande American
Eminem American
Lady Gaga American
Nick Jonas American
6ix9ine American
Post Malone American
In [109]:
data = [go.Bar(
           x=[973, 151, 62, 49, 35, 14, 13, 12, 11, 9, 9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4],
           y=['American', 'British', 'South Korean', 'Canadian', 'Australian', 'French', 'Filipino', 'Chinese', 'Indian', 'German', 'Irish', 'Argentinian', 'Jamaican', 'Mexican', 'Spanish', 'Swedish', 'Welsh', 'Puerto Rican', 'Japanese', 'Norwegian', 'Italian', 'South African', 'Nigerian', 'Scottish', 'Pakistani', 'Colombian'],
           orientation = 'h'
)]

py.iplot(data, filename='horizontal-bar')
Out[109]:

It looks pretty clear that being american should help.

In [110]:
#plt.figure(figsize=(12,5))
#plt.bar(data_artists.head(100).index, data_artists['Streams'].head(100))
#plt.title('Number of monthly streams per artist')
#plt.show()

What kind of music should you play ?

In [166]:
genres = pd.read_csv('genres.csv', skiprows=1)
In [183]:
genres.plot(figsize=(15,10))
plt.show()

Rap music has over the past years invaded the music industry, and Google Trends reflect it.

What kind of lyrics should you write ?

In [1]:
payload={
'genius_client_id' : 'G7eVkDc7DNvNaEZh7BobDATAa3pZ6SYmLoZgkCkIK0iw6FxDJfaYuBf6u8SjLr8O',
'genius_secret_id' : 'Fv5hJi3WCP6a2N8OUxE61bJ7Y7m9cblQfysSlxt8GhaUTXoSAlndi6l1OIlE_uESb3Cycv64Eb_UrFZp46C-Pw',
'genius_client_access_token' : 'VGxZYl4kHnoBcj_hMiUA0DtweOQvySa8c7hi_fvyqbKd__3or_Lkn75yCG6_immb'}

base_url = 'https://api.genius.com/'

r = requests.get(base_url, params=payload)
print(r.status_code) #200 is good
/anaconda3/lib/python3.6/site-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.21.1) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
403
In [111]:
#From the Spotify top Charts between December 16 until now :
tracks = pd.read_csv('Tracks.csv')
In [115]:
tracks = tracks.sort_values('Streams', ascending = False)
tracks.head()
Out[115]:
Unnamed: 0 Position Track Name Artist Streams
673 673 1 God's Plan Drake 54061893
200 200 1 Shape of You Ed Sheeran 51124648
542 542 1 rockstar Post Malone 46473893
1135 1135 1 thank u, next Ariana Grande 45530537
718 718 2 Psycho (feat. Ty Dolla $ign) Post Malone 45142484

To grab the lyrics from those top hits, we are using the Genius.com API-

In [114]:
api = genius.Genius('VGxZYl4kHnoBcj_hMiUA0DtweOQvySa8c7hi_fvyqbKd__3or_Lkn75yCG6_immb')
In [ ]:
%%capture
#Grab lyrics
i = 0
for track in zip(tracks['Track Name'], tracks['Artist']) :
    try :
        song = api.search_song(str(track[0]), str(track[1]))
        song.save_lyrics('/Users/maelfabien/TelecomParisTech/INFMDI721/Hackathon/New_songs/' + str(track[1] + str(i)))
    except : 
        pass
    i = i + 1
In [118]:
files = sorted(glob(op.join('/Users/maelfabien/TelecomParisTech/INFMDI721/Hackathon/New_songs/', '*.txt')))
songs = [open(f).read() for f in files]
In [119]:
#Clean extracted text
for i in range(0, len(songs)) :
    songs[i] = songs[i].replace("\n", " ").replace("\'", " ")
    songs[i] = re.sub(r"\[(.*?)\]", " ", songs[i])
In [121]:
#Remove stopwords
cachedStopWords = stopwords.words("english")

words = []
filtered = []


for i in range(0, len(songs)) :
    words.append(re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", songs[i]))
    filtered.append(' '.join([word for word in songs[i].split() if word not in cachedStopWords]))

    #filtered.append([word for word in songs[i] if word not in stopwords.words('english')])  
In [122]:
#Average number of words
voc = []
voc_unique = []
richness = []

for i in range(0, len(songs)) :
    voc.append(len(words[i]))
    voc_unique.append(len(filtered[i].split()))
    richness.append(len(filtered[i].split()) / len(words[i]))

How many words should you pronounce in a single song ?

In [124]:
plt.figure(figsize=(12,5))
plt.hist(np.array(voc), bins=40)
plt.title('Number of words in a top 2018 song')
plt.show()
In [17]:
np.array(voc).mean()
Out[17]:
489.15

On average, you should write a 490 words.

Should you repeat words ?

Yeah, Yeah, Yeah...

In [125]:
length = [len(set(word)) for word in words]
In [127]:
round(np.array(length).mean(),2)
Out[127]:
161.91

Yes, Yes, Yes, repeating a word 3 times on average sounds good.

In [128]:
plt.figure(figsize=(12,5))
plt.hist(np.array(length), bins=40)
plt.title('Number of unique words in a top 2018 song')
plt.show()

And if we remove stopwords, are there any words left ?

In [130]:
length_fil = [len(set(word)) for word in filtered]
round(np.array(length_fil).mean(),2)
Out[130]:
45.44

Well, there aren't many words left.

In [129]:
plt.figure(figsize=(12,5))
plt.hist(np.array(length_fil), bins=40)
plt.title('Number of unique words in a top 2018 song, without stopwords')
plt.show()

What are the main words you should use ?

Tough question, it might acutally be the reflect of cool stuffs in our society.

In [133]:
word_cloud = list(itertools.chain.from_iterable(words))
In [134]:
str1 = ' '.join(word_cloud)
In [135]:
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(str(str1))
In [136]:
plt.figure(figsize=(15,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [137]:
wordcloud.to_file("wordcloud.png")
Out[137]:
<wordcloud.wordcloud.WordCloud at 0x1a2d03d160>

Nevermind.

Should you write a positive or a negative song ?

In [138]:
filename = 'model_sentiment_analysis.sav'
loaded_model = pickle.load(open(filename, 'rb'))
In [139]:
result = loaded_model.predict(songs)
In [140]:
print(result.mean())
0.325

It seems that on average, only 32% of top hits are positve songs. So maybe you should focus on something rather negative. Yeah. Baby. Oh.

What kind of mood should you express ?

In [161]:
filename = 'sentiment_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))
In [162]:
result = loaded_model.predict(songs)
In [166]:
unique, counts = np.unique(result, return_counts=True)
In [168]:
print(unique, counts)
[0 1 2 3] [37 24 13 46]

The 0 corresponds to angry, 1 to sad, 2 to happy, 3 to relax. Therefore, the song should rather be relax.

What kind of music should you make ?

Here we send requests to the Spotify API in order to retrieve specific information on songs.

  • duration_ms : The duration of the track in milliseconds.

  • key : The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

  • mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
  • acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. The distribution of values for this feature look like this: Acousticness distribution
  • danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. The distribution of values for this feature look like this: Danceability distribution
  • energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. The distribution of values for this feature look like this: Energy distribution
  • instrumentalness :Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. The distribution of values for this feature look like this: Instrumentalness distribution
  • liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. The distribution of values for this feature look like this: Liveness distribution
  • loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. The distribution of values for this feature look like this: Loudness distribution
  • speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. The distribution of values for this feature look like this: Speechiness distribution
  • valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). The distribution of values for this feature look like this: Valence distribution tempo float The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. The distribution of values for this feature look like this: Tempo distribution
  • tempo : The overall estimated tempo of the section in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • key : The estimated overall key of the section. The values in this field ranging from 0 to 11 mapping to pitches using standard Pitch Class notation (E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on). If no key was detected, the value is -1.
  • mode:integer Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. This field will contain a 0 for “minor”, a 1 for “major”, or a -1 for no result. Note that the major key (e.g. C major) could more likely be confused with the minor key at 3 semitones lower (e.g. A minor) as both keys carry the same pitches.
  • mode_confidence: The confidence, from 0.0 to 1.0, of the reliability of the mode.
  • time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”.
In [155]:
import requests, json, logging
import pandas as pd
import base64
import six

def get_info(song_name = 'africa', artist_name = 'toto', req_type = 'track'):
    client_id = 'c06f3a0b980d4bc4b6e52145b4d5e619'
    client_secret = 'dd2dd34dc3c74076a14c31801058d05a'
    auth_header = {'Authorization' : 'Basic %s' % base64.b64encode(six.text_type(client_id + ':' + client_secret).encode('ascii')).decode('ascii')}
    r = requests.post('https://accounts.spotify.com/api/token', headers = auth_header, data= {'grant_type': 'client_credentials'})
    token = 'Bearer {}'.format(r.json()['access_token'])
    headers = {'Authorization': token, "Accept": 'application/json', 'Content-Type': "application/json"}
    
    payload = {"q" : "artist:{} track:{}".format(artist_name, song_name), "type": req_type, "limit": "1"}
    
    res = requests.get('https://api.spotify.com/v1/search', params = payload, headers = headers)
    res = res.json()['tracks']['items'][0]
    year = res['album']['release_date'][:4]
    month = res['album']['release_date'][5:7]
    day = res['album']['release_date'][8:10]
    artist_id = res['artists'][0]['id']
    artist_name = res['artists'][0]['name'].lower()
    song_name = res['name'].lower()
    track_id = res['id']
    track_pop = res['popularity']

    res = requests.get('https://api.spotify.com/v1/audio-analysis/{}'.format(track_id), headers = headers)
    res = res.json()['track']
    duration = res['duration']
    end_fade = res['end_of_fade_in']
    key = res['key']
    key_con = res['key_confidence']
    mode = res['mode']
    mode_con = res['mode_confidence']
    start_fade = res['start_of_fade_out']
    temp = res['tempo']
    time_sig = res['time_signature']
    time_sig_con = res['time_signature_confidence']
    
    res = requests.get('https://api.spotify.com/v1/audio-features/{}'.format(track_id), headers = headers)
    res = res.json()
    acousticness =  res['acousticness']
    danceability = res['danceability']
    energy = res['energy']
    instrumentalness = res['instrumentalness']
    liveness = res['liveness']
    loudness = res['loudness']
    speechiness = res['speechiness']
    valence = res['valence']
    
    res = requests.get('https://api.spotify.com/v1/artists/{}'.format(artist_id), headers = headers)
    artist_hot = res.json()['popularity']/100
    
    return pd.Series([artist_name, song_name, duration, key,mode,temp,artist_hot,end_fade, start_fade, mode_con,key_con,time_sig,time_sig_con,acousticness,danceability,energy ,instrumentalness,liveness,loudness,speechiness,valence, year, month, day, track_pop], index = ['artist_name', 'song_name', 'duration','key','mode','tempo','artist_hotttnesss','end_of_fade_in','start_of_fade_out','mode_confidence','key_confidence','time_signature','time_signature_confidence','acousticness','danceability','energy' ,'instrumentalness','liveness','loudness','speechiness','valence','year','month', 'day', 'track_popularity'])

This function tests if a song request through the API is successful.

In [156]:
def test(song_name = 'africa', artist_name = 'toto', req_type = 'track'):
    client_id = 'c06f3a0b980d4bc4b6e52145b4d5e619'
    client_secret = 'dd2dd34dc3c74076a14c31801058d05a'
    auth_header = {'Authorization' : 'Basic %s' % base64.b64encode(six.text_type(client_id + ':' + client_secret).encode('ascii')).decode('ascii')}
    r = requests.post('https://accounts.spotify.com/api/token', headers = auth_header, data= {'grant_type': 'client_credentials'})
    token = 'Bearer {}'.format(r.json()['access_token'])
    headers = {'Authorization': token, "Accept": 'application/json', 'Content-Type': "application/json"}
    
    payload = {"q" : "artist:{} track:{}".format(artist_name, song_name), "type": req_type, "limit": "1"}
    
    res = requests.get('https://api.spotify.com/v1/search', params = payload, headers = headers)
    if not res.json()['tracks']['items']:
        return False
    else:
        return True

This part of the code iterates over our dataset of hit songs (.csv file) in order to request the Spotify API and retrieve the audio features. Everything is gathered in a single dataframe, and we create a feature by combining the mode confidence and the mode together.

In [ ]:
song_list = pd.read_csv('/Users/raphaellederman/Downloads/Tracks_Hackathon_treated (4).csv', sep = ';')
print(type(song_list['Track Name']))
rows= []
features = ['artist_name', 'song_name', 'duration','key','mode','tempo','artist_hotttnesss','end_of_fade_in','start_of_fade_out','mode_confidence','key_confidence','time_signature','time_signature_confidence','acousticness','danceability','energy' ,'instrumentalness','liveness','loudness','speechiness','valence','year','month', 'day', 'track_popularity']
for index, row in song_list.iterrows():
    print(row['Track Name'].replace('\'','') + ' - ' + row['Artist'])
    if test(row['Track Name'].replace('\'',''), row['Artist'], req_type = 'track') == True :
        rows.append(get_info(row['Track Name'].replace('\'',''), row['Artist'], req_type = 'track'))
data = pd.DataFrame(rows, columns=features)
data['mode_confidence'] = np.where(data['mode'] == 1, data['mode']* data['mode_confidence'], (data['mode']- 1)* data['mode_confidence'])
data = data.drop('mode', axis=1)

Next is a bit of data cleaning.

In [ ]:
data_songs = data_songs.reset_index().drop(['index', 'artist_name', 'song_name'], axis=1).replace('', np.nan).dropna()

Now we use statistical tools to analyze the data

What are the most important features ?

In [ ]:
data = pd.read_csv('/Users/anthonyhoudaille/Desktop/HACKATHON_30_11_2018/data_final.csv')
In [5]:
print(data.head())
   Unnamed: 0       artist_name                                song_name  \
0           0        the weeknd                                  starboy   
1           1  the chainsmokers                                   closer   
2           2      clean bandit  rockabye (feat. sean paul & anne-marie)   
3           3          dj snake                          let me love you   
4           4          maroon 5  don't wanna know (feat. kendrick lamar)   

    duration  key    tempo  artist_hotttnesss  end_of_fade_in  \
0  230.45333    7  186.005               0.90         2.85025   
1  244.96000    8   95.010               0.89         0.17914   
2  251.08816    9  101.965               0.85         0.00000   
3  205.94667    8  100.023               0.89         0.14526   
4  214.26531    7  100.047               0.90         0.20304   

   start_of_fade_out  mode_confidence        ...         energy  \
0          221.72735            0.608        ...          0.588   
1          231.20109            0.618        ...          0.524   
2          245.78322           -0.525        ...          0.763   
3          199.16916            0.590        ...          0.713   
4          210.46567            0.284        ...          0.610   

   instrumentalness  liveness  loudness  speechiness  valence  year  month  \
0          0.000006    0.1370    -7.015       0.2760    0.486  2016   11.0   
1          0.000000    0.1110    -5.599       0.0338    0.661  2016    7.0   
2          0.000000    0.1800    -4.068       0.0523    0.742  2016   10.0   
3          0.000010    0.1440    -5.311       0.0368    0.152  2016    8.0   
4          0.000000    0.0983    -6.124       0.0696    0.418  2018    6.0   

    day  track_popularity  
0  25.0                83  
1  29.0                85  
2  21.0                81  
3   5.0                82  
4  15.0                71  

[5 rows x 25 columns]
In [19]:
# Correlation Matrix 
import seaborn as sn
fig = plt.figure(figsize=(20, 20))
sn.heatmap(data.corr(),  annot=True)
plt.title('Correlation of every features', fontsize=20)
Out[19]:
Text(0.5,1,'Correlation des MidTerm')

We are trying to predict how likely the artist is to be considered as a hot artist in 2019. Therefore, we focus on the variables that have a correlation of more than 0.1 or less than -0.1 with the artist Hotness.

In [ ]:
s = pd.Series(data['month']).dropna()
fig, ax = plt.subplots(figsize = (12,12))
ax.hist(s, alpha=0.8, color='blue', bins = 25)
ax.xaxis.set_ticks(range(13))
ax.xaxis.set_ticklabels( [' ','Janvier', 'Fevrier', 'Mars', 'Avril', 'Mai', 'Juin', 'Juillet', 'Aout', 'Septembre', 'Octobre', 'Novembre', 'Decembre'])
plt.title("Historgram of the number of hit by month")
In [108]:
features = [ 'duration', 'tempo', 'danceability', 'end_of_fade_in', 'start_of_fade_out','energy', 'speechiness','valence','track_popularity']
i=1
for feature in features : 
    plt.figure(figsize=(15,15))
    plt.subplot(3,3, i )
    plt.scatter(data['artist_hotttnesss'], data[feature])
    plt.title("correlation between artist_hotttnesss and " + feature)
    i+=1
In [95]:
plt.figure(figsize = (10,10))
plt.scatter( data['speechiness'], data['artist_hotttnesss'])
plt.title("Correlation between speechiness and Popularity")
Out[95]:
Text(0.5,1,'Correlation between speechiness and Popularity')

For example, the speechiness seems to be correlated to the artiss hotness. Indeed, this would mean that rap music, in which the speechiness is the most important, is a popular music, which is the case.

Are we able to build a prediction from the features we extracted ?

In [378]:
features_columns = [col for col in data_songs.drop("track_popularity", axis = 1).columns]
X = data_songs[features_columns].apply(pd.to_numeric, errors='coerce')
y = data_songs['track_popularity'].apply(pd.to_numeric, errors='coerce')

# Split the data in order to compute the accuracy score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
In [382]:
# Construct a plot showing the most important features in the dataset using a Random Forest Classifier
rnd_clf = RandomForestRegressor(n_estimators=100, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

importances_rf = rnd_clf.feature_importances_
indices_rf = np.argsort(importances_rf)

n = len(indices_rf)
sorted_features_rf = [0] * n;  
for i in range(0,n): 
    sorted_features_rf[indices_rf[i]] = features_columns[i] 

plt.figure(figsize=(140,120) )
plt.title('Random Forest Features Importance')
plt.barh(range(len(indices_rf)), importances_rf[indices_rf], color='b', align='center')
plt.yticks(range(len(indices_rf)), sorted_features_rf)
plt.xlabel('Relative Importance')
plt.tick_params(axis='both', which='major', labelsize=100)
plt.tick_params(axis='both', which='minor', labelsize=100)

plt.show()
In [395]:
def accuracy_score(y_test, predictions):
    correct = 0
    for x in range(len(y_test)):
        if y_test[x] == predictions[x]:
            correct += 1
    return (correct/float(len(y_test))) * 100.0

In the following lines, we try several different models, droping some of the most irrelevant features.

In [467]:
worst_features = ['instrumentalness', 'time_signature_confidence', 'time_signature', 'energy','key_confidence', 'end_of_fade_in']
In [472]:
# We fit the model on the whole train dataset
clf = RandomForestRegressor(n_estimators = 100,max_depth=1000, bootstrap= True, n_jobs=4)
model = clf.fit(X_train.drop(worst_features, axis=1), y_train)
pred = model.predict(X_test.drop(worst_features, axis=1))
score_rf = metrics.r2_score(y_test, pred)
print(score_rf)
-0.040353767623977355
In [471]:
# We fit the model on the whole train dataset
clf = LinearRegression()
model = clf.fit(X_train.drop(worst_features, axis=1), y_train)
pred = model.predict(X_test.drop(worst_features, axis=1))
score_reg = metrics.r2_score(y_test, pred)
print(score_reg)
0.0855710983683905
In [478]:
lasso_regression = LassoCV(cv=4, alphas = range(1,10),random_state=0)
model = clf.fit(X_train.drop(worst_features, axis=1), y_train)
pred = model.predict(X_test.drop(worst_features, axis=1))
score_lasso= metrics.r2_score(y_test, pred)
print(score_lasso)
-0.04124335540356605

Our best model is the gradient boosting, with approximately 26% of r2 score

In [610]:
# We fit the model on the whole train dataset
clf = xgboost.XGBRegressor(colsample_bytree = 0.44, n_estimators=30000, learning_rate=0.07,max_depth=9,alpha = 5)
model = clf.fit(X_train.drop(worst_features, axis=1), y_train)
pred = model.predict(X_test.drop(worst_features, axis=1))
score_rf = metrics.r2_score(y_test, pred)
print(score_rf)
0.2603486640030881

We observe that we are able to predict 21% of the popularity variance based on the audio features extracted from the Spotify API.

In [617]:
data_songs.describe()
Out[617]:
duration key tempo artist_hotttnesss end_of_fade_in start_of_fade_out mode_confidence key_confidence time_signature time_signature_confidence acousticness danceability energy instrumentalness liveness loudness speechiness valence track_popularity
count 934.000000 934.000000 934.000000 934.000000 934.000000 934.000000 934.000000 934.000000 934.00000 934.000000 934.000000 934.000000 934.000000 934.000000 934.000000 934.00000 934.000000 934.000000 934.000000
mean 212.624056 5.140257 121.445416 0.863490 0.475225 205.515999 0.096741 0.448511 3.98394 0.951100 0.196375 0.695937 0.643866 0.006926 0.173738 -6.08718 0.125304 0.468799 75.533191
std 41.810051 3.699400 29.502172 0.084596 1.244497 41.465818 0.510261 0.216680 0.20382 0.118596 0.212415 0.130587 0.161653 0.048293 0.131171 2.23869 0.111350 0.216823 9.868507
min 73.813330 0.000000 54.747000 0.490000 0.000000 63.239550 -0.891000 0.000000 1.00000 0.107000 0.000031 0.258000 0.054900 0.000000 0.021500 -20.51400 0.023200 0.037100 0.000000
25% 189.176665 1.000000 97.978250 0.820000 0.000000 181.908028 -0.461750 0.305250 4.00000 0.972000 0.036775 0.617000 0.539000 0.000000 0.095225 -7.27325 0.047150 0.293000 70.000000
50% 209.446330 5.000000 119.913500 0.870000 0.118275 202.135515 0.351000 0.451500 4.00000 1.000000 0.120500 0.708000 0.658000 0.000000 0.120000 -5.80050 0.075050 0.460000 76.000000
75% 232.746805 8.000000 141.963500 0.920000 0.255370 225.830020 0.544000 0.593000 4.00000 1.000000 0.272750 0.781000 0.768000 0.000015 0.210000 -4.54375 0.167000 0.629500 81.000000
max 460.573330 11.000000 212.058000 1.000000 23.243170 456.602990 1.000000 1.000000 5.00000 1.000000 0.983000 0.968000 0.963000 0.843000 0.866000 0.17500 0.740000 0.969000 99.000000

Making a video clip on YouTube, worth it ?

In [148]:
website = "https://socialblade.com/youtube/top/category/music/mostviewed"

res = requests.get(website)
soup = _handle_request_result_and_build_soup(res)

datas1 = soup.find_all('div', style="width: 860px; background: #fafafa; padding: 10px 20px; color:#444; font-size: 10pt; border-bottom: 1px solid #eee; line-height: 40px;")
datas2 = soup.find_all('div', style="width: 860px; background: #f8f8f8;; padding: 10px 20px; color:#444; font-size: 10pt; border-bottom: 1px solid #eee; line-height: 40px;")
datas3 = soup.find_all('div', style="width: 860px; background: #fafafa; padding: 0px 20px; color:#444; font-size: 10pt; border-bottom: 1px solid #eee; line-height: 30px;")
datas4 = soup.find_all('div', style="width: 860px; background: #f8f8f8;; padding: 0px 20px; color:#444; font-size: 10pt; border-bottom: 1px solid #eee; line-height: 30px;")
In [149]:
channels_metrics = {}

for data in datas1:  
    
    channel = data.find("div", style="float: left; width: 350px; line-height: 25px;")
    chan = channel.text.strip()
    
    folowers = data.find('div', style="float: left; width: 150px;")
    folowers = int(folowers.text.strip().replace(',', ''))
    
    channels_metrics[chan] = folowers
    
for data in datas2:
    
    channel = data.find('div', style='float: left; width: 350px; line-height: 25px;')
    chan = channel.text.strip()
    folowers = data.find('div', style="float: left; width: 150px;")
    folowers = int(folowers.text.strip().replace(',', ''))
    channels_metrics[chan] = folowers
    
for data in datas3:
    
    channel = data.find('div', style='float: left; width: 350px; line-height: 25px;')
    chan = channel.text.strip()
    folowers = data.find('div', style="float: left; width: 150px;")
    folowers = int(folowers.text.strip().replace(',', ''))
    channels_metrics[chan] = folowers

for data in datas4:
    
    channel = data.find('div', style='float: left; width: 350px; line-height: 25px;')
    chan = channel.text.strip()
    folowers = data.find('div', style="float: left; width: 150px;")
    
    try:
        folowers = int(folowers.text.strip().replace(',', ''))
    except ValueError as verr:
        folowers = 0
    
    channels_metrics[chan] = folowers
In [150]:
# Subscribers repartition
df_channels = pd.DataFrame(list(channels_metrics.items()), columns=['Channel', 'Followers'])

plt.figure(figsize=(15, 6))
fol_list = df_channels.sort_values('Followers', ascending=0)['Followers'].tolist()
plt.plot(fol_list)
plt.xlabel("Top250 Music Channels")
plt.ylabel("Nb of suscribers")
plt.show()

This corresponds to the number of subscribers on the 250 biggest Youtube Channels. That looks light quite many people.

In [151]:
countries = ['us', 'de', 'nl', 'in', 'au', 'gb', 'fr', 'cn', 'br', 'ru', 'it', 'es', 'ca', 'ar', 'pt', 'jp', 'lt', 'lu']
music_rate = {}

website_base = "https://socialblade.com/youtube/top/country/"

for country in countries:
    website = website_base + country
    res = requests.get(website)
    soup = _handle_request_result_and_build_soup(res)
    
    datas5 = soup.find_all('i', style="color:#aaa; padding-left: 5px;")
    
    genres = []
    for data in datas5:
        genres.append(data.get("title"))
    
    cnt = genres.count("Category: music")
    music_rate[country] = cnt * 100 / 250

music_rate
Out[151]:
{'ar': 28.0,
 'au': 14.8,
 'br': 22.4,
 'ca': 10.4,
 'cn': 4.4,
 'de': 16.4,
 'es': 14.4,
 'fr': 25.6,
 'gb': 20.0,
 'in': 21.6,
 'it': 19.6,
 'jp': 9.6,
 'lt': 34.4,
 'lu': 20.0,
 'nl': 26.8,
 'pt': 24.4,
 'ru': 8.4,
 'us': 27.2}
In [153]:
plt.figure(figsize=(15, 6))
plt.bar(range(len(music_rate)), list(music_rate.values()), align='center')
plt.xticks(range(len(music_rate)), list(music_rate.keys()))
plt.title('Number of music channels in the top 250 Youtube Channels per country')
plt.show()

When should the song be released ?

In [80]:
s = pd.Series(data['month']).dropna()
fig, ax = plt.subplots(figsize = (12,12))
ax.hist(s, alpha=0.8, color='blue', bins = 25)
ax.xaxis.set_ticks(range(13))
ax.xaxis.set_ticklabels( [' ','Janvier', 'Fevrier', 'Mars', 'Avril', 'Mai', 'Juin', 'Juillet', 'Aout', 'Septembre', 'Octobre', 'Novembre', 'Decembre'])
plt.title("Historgram of the number of hit by month")
Out[80]:
Text(0.5,1,'Historgram of the number of hit by month')

Based on the realease dates in the Spotify API, the best release date should be March.

Conclusion

The song should be :

  • sang by an american
  • focus on the US market
  • include a featuring with XXX Tentacion or Post Malone e.g
  • be rather rap music
  • be 500 words long
  • include keywords : Baby, Yeah, Know, got, Love
  • be rather negative, but still relax
  • have 121 BPM
  • be danceable
  • with a key mode : Sol
  • be published on YouTube, especially in Italy or US
  • and be released next March

Prediction : Ariana Grande + Post Malone

Send report

In [160]:
%%javascript
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
In [166]:
import os
os.system("jupyter nbconvert --to slides baseline-college.ipynb")
#latest modif
Out[166]:
0
In [167]:
import os 
import smtplib

from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email.mime.image import MIMEImage
from email import encoders

with open('Login.txt', 'r') as file:
    gmail_pwd = file.read()

#Set up crap for the attachments
#files = "/tmp/test/dbfiles"
filenames = ["Summary.html", "map.html"]
#print filenames


#Set up users for email
gmail_user = "mael.fabien@gmail.com"
#gmail_pwd = open('Login.txt', 'r')
recipients = ['anatoli.db@gmail.com','anthonyhoudaille@gmail.com', 'raphael.lederman@wanadoo.fr', 'alexandre.bec@telecom-paristech.fr']

#Create Module
def mail(to, subject, text, attach):
    msg = MIMEMultipart()
    msg['From'] = gmail_user
    msg['To'] = ", ".join(recipients)
    msg['Subject'] = subject

    msg.attach(MIMEText(text))

    #get all the attachments
    for file in filenames:
        part = MIMEBase('application', 'octet-stream')
        part.set_payload(open(file, 'rb').read())
        encoders.encode_base64(part)
        part.add_header('Content-Disposition', 'attachment; filename="%s"' % file)
        msg.attach(part)

    mailServer = smtplib.SMTP("smtp.gmail.com", 587)
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(gmail_user, gmail_pwd)
    mailServer.sendmail(gmail_user, to, msg.as_string())
    # Should be mailServer.quit(), but that crashes...
    mailServer.close()

body = '''Bonjour Charles, 


Vous trouverez en pièce jointe de ce mail notre travail pour le Hackathon du MS Big Data.


Meilleures salutations,


Anatoli, Raphaël, Anthony, Alexandre, Maël 

'''
#send it
mail(recipients,"Hackathon",body,filenames)

<iframe src = 'fatal.png'>